1 Quick Quality Control

Here we take a quick first look at the data from the provided dataset.

library(scater, quietly = TRUE)
library(knitr)
## options

opts_chunk$set(fig.align = 'center', fig.width = 10, fig.height = 5, dev = 'png',
               warning = FALSE, message = FALSE)
library(ggplot2)
library(viridis)
library(ggthemes)
library(cowplot)
library(magrittr)
library(destiny)
## load data
load(params$rdata_file)

1.1 Calculate QC metrics

To help with the QC of the dataset, scater enables straight-forward calculation of many QC metrics with the calculateQCMetrics function. To help produce useful metrics, it is good to identify gene and cell controls, if present. In this case, we use ERCC spike-ins and mitochondrial genes as gene controls (called “feature_controls” in `scater).

But first, filter out genes with no expression at all - time-wasters.

nonzero_genes <- (rowSums(counts(sce_gene)) > 0)
sum(nonzero_genes)
sce_gene_filt <- sce_gene[nonzero_genes,]

This leaves us with 36376 genes and 707 cells.

However, I will set a low threshold to call an observation “expressed” - any count above zero suffices. Below is a summary of the percentage of counts coming from ERCC spike-in controls and mitochondrial (MT) genes.

count_thresh <- 0
is_exprs(sce_gene_filt) <- (counts(sce_gene_filt) > count_thresh)
ercc_genes <- grepl("^ERCC", featureNames(sce_gene_filt))
mt_genes <- grepl("^MT-", fData(sce_gene_filt)$hgnc_symbol)
sce_gene_filt <- calculateQCMetrics(sce_gene_filt, 
    feature_controls = list(ERCC = ercc_genes, MT = mt_genes))
stats <- summary(sce_gene_filt$pct_counts_feature_controls_ERCC)
kable(data.frame("Stat" = names(stats), "Pct Counts from ERCC" = as.vector(stats)))
Stat Pct.Counts.from.ERCC
Min. 0.00000
1st Qu. 0.06476
Median 0.23270
Mean 2.02900
3rd Qu. 0.43440
Max. 92.96000
stats_MT <- summary(sce_gene_filt$pct_counts_feature_controls_MT)
kable(data.frame("Stat" = names(stats_MT), "Pct Counts from MT" = as.vector(stats_MT)))
Stat Pct.Counts.from.MT
Min. 0.000
1st Qu. 2.777
Median 4.218
Mean 8.221
3rd Qu. 5.740
Max. 80.410

1.2 Expressed genes vs total counts

Plot number of expressed features against total counts, with cells coloured by different QC metrics:

p1 <- plotPhenoData(sce_gene_filt, 
              aes(x = log10_total_counts, y = total_features, 
                  colour =  filter_on_total_features)) +
    theme(legend.position = "bottom") +
    geom_vline(xintercept = 5, linetype = 2)
p2 <- plotPhenoData(sce_gene_filt, 
              aes(x = log10_total_counts, y = total_features, 
                  colour =  pct_counts_top_100_features)) +
    theme(legend.position = "bottom") +
    geom_vline(xintercept = 5, linetype = 2)
p3 <- plotPhenoData(sce_gene_filt, 
              aes(x = log10_total_counts, y = total_features, 
                  colour =  pct_counts_feature_controls_ERCC)) +
    theme(legend.position = "bottom") +
    geom_vline(xintercept = 5, linetype = 2)
plot_grid(p1, p2, p3, labels = c("A", "B", "C"), 
          nrow = 1)

1.3 % Dropout vs % Counts from ERCC spike-ins

Look at percentage of genes not expressed (percentage dropout) for each cell plotted against the percentage of counts obtained from ERCC spike-in controls.

1.4 Cumulative Expression Plots

The cumulative expression plot (with cells coloured by percentage of counts coming from feature controls, i.e. ERCC and MT genes) shows a handful of low complexity libraries and otherwise a range of library complexities associated with the percentage of expression accounted for by feature controls..

p1 <- plot(sce_gene_filt, exprs_values = "counts",
           colour_by = "pct_exprs_feature_controls_ERCC") +
    theme(legend.position = "bottom")
p2 <- plot(sce_gene_filt, exprs_values = "counts",
           colour_by = "pct_exprs_feature_controls_MT") +
    theme(legend.position = "bottom")
plot_grid(p1, p2, labels = c("A", "B"), nrow = 1)

1.5 Most expressed genes

Plot the most expressed genes across each dataset. Surprisingly, the ERCC spike-ins do not feature in the list of most-expressed genes. Many MT and ribosomal genes appear, along with GAPDH, as expected. The most expressed gene is the Y chromosome gene AC010970.2.

plotQC(sce_gene_filt)

1.6 Gene expression frequency against mean

Plot expression frequency against mean expression. This highlights generally decreasing dropout with increasing mean expression.

1.7 PCA on QC metrics with outlier detection

Another option available in scater is to conduct PCA on a set of QC metrics. The advantage of doing this is that the QC metrics focus on technical aspects of the libraries that are likely to distinguish problematics cells. Automatic outlier detection on PCA plots using QC metrics is available to help identify potentially problematic cells.

We use the following metrics for PCA-based outlier detection:

  • pct_counts_top_100_features
  • total_features
  • pct_counts_feature_controls_MT
  • pct_counts_feature_controls_ERCC
  • n_detected_feature_controls
  • log10_counts_endogenous_features
  • log10_counts_feature_controls

A particular set of variables to be used can be specified with the selected_variables argument as shown in the example below.

plotPCA(sce_gene_filt, size_by = "total_features", 
        shape_by = "filter_on_total_features", 
        pca_data_input = "pdata", detect_outliers = TRUE)
## The following cells/samples are detected as outliers:
## quant_kallisto/19776_1#1
## quant_kallisto/19776_1#10
## quant_kallisto/19776_1#101
## quant_kallisto/19776_1#107
## quant_kallisto/19776_1#108
## quant_kallisto/19776_1#112
## quant_kallisto/19776_1#114
## quant_kallisto/19776_1#12
## quant_kallisto/19776_1#121
## quant_kallisto/19776_1#124
## quant_kallisto/19776_1#131
## quant_kallisto/19776_1#133
## quant_kallisto/19776_1#144
## quant_kallisto/19776_1#147
## quant_kallisto/19776_1#149
## quant_kallisto/19776_1#153
## quant_kallisto/19776_1#155
## quant_kallisto/19776_1#159
## quant_kallisto/19776_1#16
## quant_kallisto/19776_1#161
## quant_kallisto/19776_1#162
## quant_kallisto/19776_1#169
## quant_kallisto/19776_1#171
## quant_kallisto/19776_1#172
## quant_kallisto/19776_1#174
## quant_kallisto/19776_1#175
## quant_kallisto/19776_1#177
## quant_kallisto/19776_1#181
## quant_kallisto/19776_1#187
## quant_kallisto/19776_1#189
## quant_kallisto/19776_1#197
## quant_kallisto/19776_1#2
## quant_kallisto/19776_1#200
## quant_kallisto/19776_1#206
## quant_kallisto/19776_1#209
## quant_kallisto/19776_1#211
## quant_kallisto/19776_1#217
## quant_kallisto/19776_1#218
## quant_kallisto/19776_1#220
## quant_kallisto/19776_1#222
## quant_kallisto/19776_1#225
## quant_kallisto/19776_1#231
## quant_kallisto/19776_1#239
## quant_kallisto/19776_1#240
## quant_kallisto/19776_1#243
## quant_kallisto/19776_1#250
## quant_kallisto/19776_1#254
## quant_kallisto/19776_1#26
## quant_kallisto/19776_1#262
## quant_kallisto/19776_1#267
## quant_kallisto/19776_1#270
## quant_kallisto/19776_1#271
## quant_kallisto/19776_1#274
## quant_kallisto/19776_1#279
## quant_kallisto/19776_1#282
## quant_kallisto/19776_1#283
## quant_kallisto/19776_1#285
## quant_kallisto/19776_1#287
## quant_kallisto/19776_1#291
## quant_kallisto/19776_1#292
## quant_kallisto/19776_1#295
## quant_kallisto/19776_1#3
## quant_kallisto/19776_1#300
## quant_kallisto/19776_1#309
## quant_kallisto/19776_1#318
## quant_kallisto/19776_1#322
## quant_kallisto/19776_1#325
## quant_kallisto/19776_1#33
## quant_kallisto/19776_1#333
## quant_kallisto/19776_1#340
## quant_kallisto/19776_1#345
## quant_kallisto/19776_1#358
## quant_kallisto/19776_1#361
## quant_kallisto/19776_1#362
## quant_kallisto/19776_1#38
## quant_kallisto/19776_1#43
## quant_kallisto/19776_1#5
## quant_kallisto/19776_1#54
## quant_kallisto/19776_1#56
## quant_kallisto/19776_1#57
## quant_kallisto/19776_1#60
## quant_kallisto/19776_1#61
## quant_kallisto/19776_1#64
## quant_kallisto/19776_1#65
## quant_kallisto/19776_1#76
## quant_kallisto/19776_1#77
## quant_kallisto/19776_1#79
## quant_kallisto/19776_1#8
## quant_kallisto/19776_1#86
## quant_kallisto/19776_1#87
## quant_kallisto/19776_1#9
## quant_kallisto/19776_1#98
## quant_kallisto/19776_1#99
## quant_kallisto/19776_2#114
## quant_kallisto/19776_2#133
## quant_kallisto/19776_2#138
## quant_kallisto/19776_2#144
## quant_kallisto/19776_2#164
## quant_kallisto/19776_2#17
## quant_kallisto/19776_2#172
## quant_kallisto/19776_2#173
## quant_kallisto/19776_2#198
## quant_kallisto/19776_2#216
## quant_kallisto/19776_2#238
## quant_kallisto/19776_2#275
## quant_kallisto/19776_2#42
## Variables with highest loadings for PC1 and PC2:
## 
##                                            PC1          PC2
## ---------------------------------  -----------  -----------
## pct_counts_top_100_features          0.4287933   -0.2472655
## pct_counts_feature_controls          0.0782467   -0.7800732
## n_detected_feature_controls         -0.4194998   -0.3829379
## log10_counts_feature_controls       -0.4489378   -0.3407878
## total_features                      -0.4568373    0.2510491
## log10_counts_endogenous_features    -0.4730626    0.0673957

1.8 Standard PCA

Produce a standard PCA too.

plotPCA(sce_gene_filt, size_by = "total_features", 
        colour_by = "pct_counts_feature_controls_ERCC")

1.9 Correlation of PCs with QC metrics

Look at which PCs are most correlated with certain QC metrics, such as…

Total features

plotQC(sce_gene_filt, type = "find-pcs", 
       variable = "total_features") 

% counts from top 50 features

plotQC(sce_gene_filt, type = "find-pcs", 
       variable = "pct_counts_top_50_features") 

1.10 Plot explanatory variables

sce_gene_filt$start_time <- NULL
zero_var <- matrixStats::rowVars(exprs(sce_gene_filt)) == 0
plotQC(sce_gene_filt[!zero_var,], "expl",
       variables = c("pct_dropout", "total_features", 
                     "pct_counts_top_200_features", 
                     "pct_counts_feature_controls_ERCC", 
                     "pct_counts_feature_controls_MT", 
                     "n_detected_feature_controls",
                     "log10_counts_endogenous_features",
                     "log10_counts_feature_controls_ERCC",
                     "log10_counts_feature_controls_MT"))